1 Background

“It’s friday night!”

Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, we already know that time series analysis is enough to provide a good forecast and seasonality explanation.

2 Working Data & Library

Library Usage :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate) 
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
library(ggplot2)
library(TSstudio) 
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(readr)
library(padr)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ggsci)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Read the source file : Train Data - data-train.csv and Test Data - data-test.csv.

fnb <- read_csv("data-train.csv")
## Rows: 137748 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): receipt_number, item_id, item_group, item_major_group, payment_typ...
## dbl  (3): quantity, price_usd, total_usd
## dttm (1): transaction_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3 Data Structure review

View Train Data, Train Data will be used for Training and Validation

head(fnb,3)
## # A tibble: 3 × 10
##   transaction_date    receipt_…¹ item_id item_…² item_…³ quant…⁴ price…⁵ total…⁶
##   <dttm>              <chr>      <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>
## 1 2017-12-01 13:32:46 A0026694   I10100… noodle… food          1    7.33    7.33
## 2 2017-12-01 13:32:46 A0026694   I10500… drinks  bevera…       1    4.12    4.12
## 3 2017-12-01 13:33:39 A0026695   I10500… drinks  bevera…       1    2.02    2.02
## # … with 2 more variables: payment_type <chr>, sales_type <chr>, and
## #   abbreviated variable names ¹​receipt_number, ²​item_group, ³​item_major_group,
## #   ⁴​quantity, ⁵​price_usd, ⁶​total_usd

The dataset includes information about:

- transaction_date: The timestamp of a transaction

- receipt_number: The ID of a transaction

- item_id : The ID of an item in a transaction

- item_group : The group ID of an item in a transaction

- item_major_group : The major-group ID of an item in a transaction

- quantity : The quantity of purchased item

- price_usd : The price of purchased item

- total_usd : The total price of purchased item

- payment_type : The payment method

- sales_type : The sales method

4 Data Preprocess

Below is the Preprocess Step which need to be done before conducting further analysis. Round datetime into hour, by using floor_date, save the result into ds column

fnb_clean <- fnb %>% 
  mutate(transaction_date=ymd_hms(transaction_date),
         ds = floor_date(transaction_date, unit = "hours")) %>%
  select(transaction_date, receipt_number, ds)

Aggregate/summarise ds column (Date Time in hour) and receipt_number column to get the number of visitors

fnb_clean <- fnb_clean %>% group_by(ds) %>% summarise(visitor = n_distinct(receipt_number)) 

Time Series Padding, to fill missing data series (in hour), so each day time Series data will consist of 24 records - 24 hours / per days.

fnb_clean <- fnb_clean %>% pad()
## pad applied on the interval: hour

Replace NA Value

fnb_clean <- fnb_clean %>% mutate(visitor = na.fill(visitor, 0))
fnb_clean
## # A tibble: 1,907 × 2
##    ds                  visitor
##    <dttm>                <int>
##  1 2017-12-01 13:00:00      16
##  2 2017-12-01 14:00:00      38
##  3 2017-12-01 15:00:00      27
##  4 2017-12-01 16:00:00      29
##  5 2017-12-01 17:00:00      44
##  6 2017-12-01 18:00:00      50
##  7 2017-12-01 19:00:00      66
##  8 2017-12-01 20:00:00      70
##  9 2017-12-01 21:00:00      63
## 10 2017-12-01 22:00:00      63
## # … with 1,897 more rows

Check any NA and No Further NA found

# Check NA 
colSums(is.na(fnb_clean))
##      ds visitor 
##       0       0

Filter Date Time in hour (ds) to capture range time of Outlet Open, from 10.00 to 22.00

fnb_clean <- fnb_clean %>% filter(hour(ds) >=10 & hour(ds) <=22)

Check Start & End of the time interval after filtering :

range(fnb_clean$ds)
## [1] "2017-12-01 13:00:00 UTC" "2018-02-18 22:00:00 UTC"
tail(fnb_clean,10)
## # A tibble: 10 × 2
##    ds                  visitor
##    <dttm>                <int>
##  1 2018-02-18 13:00:00      25
##  2 2018-02-18 14:00:00      25
##  3 2018-02-18 15:00:00      27
##  4 2018-02-18 16:00:00      34
##  5 2018-02-18 17:00:00      26
##  6 2018-02-18 18:00:00      31
##  7 2018-02-18 19:00:00      56
##  8 2018-02-18 20:00:00      67
##  9 2018-02-18 21:00:00      49
## 10 2018-02-18 22:00:00      41

5 Seasonality Analysis

5.1 Observation & Decomposition

Create Time Series Object using ts() function with frequency 13, where 13 is time range of the outlet open everyday, from 10.00 to 22.00.

fnb_ts <- ts(fnb_clean$visitor, frequency = 13)

Decompose and Plotting to inspect Seasonal, Trend and Irregular component of time series

fnb_ts_dec <- fnb_ts %>% decompose(type = "additive")

fnb_ts %>% tail(13*7*4) %>%  decompose() %>% autoplot()

Explanation :

From above plot, the Estimated Trend component is showing a pattern, where it might contains un-captured extra seasonality and this can be considered as multi-seasonal data. To solve multi-seasonality, we need to convert the data into “Multiple Seasonality Time Series” which accept multiple frequency setting.

Create “Multi-seasonality Time Series” Object using msts() function, with Frequency : Daily (13) and Weekly (13*7), this will capture seasonality in Daily and Weekly. Then decompose and plotting.

fnb_msts <- msts(data = fnb_clean$visitor,seasonal.periods = c(13,13*7))

fnb_msts_dec <- mstl(fnb_msts)

fnb_msts %>% tail(13*7*4) %>% stl(s.window = "periodic") %>% autoplot()

Explanation :

Based on above Plot, now the Estimated Trend with “Multiple Seasonality Time Series” is more smoother and clearer. And also more clearer on Daily and Weekly Seasonality, more explainable for further analysis.

5.2 Seasonality by Hour

Create Visualization of Seasonality by hour from “Standard Time Series” Object ts() and provide explanation.

fnb_clean %>% 
  mutate(Hour = hour(ds), Seasonal = fnb_ts_dec$seasonal) %>% 
  distinct(Hour, Seasonal) %>% 
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Seasonal), stat ="identity", width = 0.7)+
  scale_fill_gradient(low = "black", high = "red") +
  scale_x_continuous(breaks = seq(10,22,1)) +
  labs(title = "Seasonality Analysis  - Hourly")

Explanation :

Based on above plot, Peak Time range of the Outlet is between 19.00 - 22.00, and at 20.00 is the time where most Visitor come. And at 10.00 is the time where the least visitors come to the Outlet as the Outlet just started/opened.

5.3 Seasonality by Week & Hour

Create Visualization of Seasonality by week and hour from “Multi-Seasonal Time Series” Object msts() and provide explanation.

fnb_msts_dec_fr <- data.frame(fnb_msts_dec)

fnb_msts_dec_fr %>%
  mutate(ds = fnb_clean$ds) %>% 
  mutate(Day_of_Week  = wday(ds, label = TRUE, abbr = FALSE), Hour = (hour(ds))) %>% 
  group_by(Day_of_Week, Hour) %>%
  summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week),stat ="identity", position = "stack", width = 0.7)+
  scale_x_continuous(breaks = seq(10,22,1)) +
  scale_fill_locuszoom()+
  labs(title = "Multi-Seasonality Analysis  - Weekly & Hourly")
## `summarise()` has grouped output by 'Day_of_Week'. You can override using the
## `.groups` argument.

Explanation :

Based on above plot, Peak Time range of the Outlet is between 19.00 - 22.00 every day of the week, and On Saturday at 20.00 is the time where most Visitor come. And at 10.00 - every day of the week, is the time where the least visitors come as the Outlet just started/opened.

6 Model Fitting and Evaluation

6.1 Cross Validation

We will split our train data into two type of data set named train and validation’. Ourvalidation` data will be a set of data consist of the last seven days of the restaurant operational hour.

# Cross Validation

fnb_test_msts <- tail(fnb_clean, 13*7)

fnb_test_msts$visitor <- round(fnb_test_msts$visitor)

fnb_train_msts <- head(fnb_clean, nrow(fnb_clean) - 13*7)

fnb_train_msts$visitor <- round(fnb_train_msts$visitor)

# Plot data Train & Validation 

Plot <- fnb_train_msts %>%
   ggplot(aes(x = ds, y = visitor)) +
   geom_line() +
   scale_x_datetime(name = "Transaction Date", 
                    date_breaks = "2 weeks", 
                    expand = expansion(mult = 0.05, add = 0.05)) +
   scale_y_continuous(breaks = seq(0, 400, 50), expand = expansion(mult = 0.05, add = 0.05)) +
   geom_line(data = fnb_test_msts, 
             aes(color = "red"), 
             show.legend = T)
  
ggplotly(Plot)

The graphic above shows us the composition of our data:

train data started from 2017-12-01 into 2018-02-11 validation data started from 2018-02-12 into 2018-02-18

6.2 Modelling

After we succeed creating the MSTS object, we will try to put our object into several models. Some models that we will try to create is

- Multi-Seasonal ARIMA model
- Multi-Seasonal Holt-Winter
- TBATS Model
  1. Multi-Seasonal ARIMA :
# Create Model
model_arima <- stlm(fnb_msts, method = "arima")
  1. Multi-seasonal Holt-Winter
# Create Model
model_hw <- stlm(fnb_msts, method = "ets")
  1. TBATS model
# Create Model

 model_tbats <- fnb_msts %>%
   tbats(use.box.cox = F,
         use.trend = T,
         use.damped.trend = T)

6.3 Forecast

forecast_arima <- forecast(model_arima, h = 13*7)
forecast_hw <- forecast(model_hw, h = 13*7)
forecast_tbats <- forecast(model_tbats, h = 13*7)

6.4 Visualisasi

!. Multi-Seasonal ARIMA :

fnb_msts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_arima$fitted, series = "train") +
  autolayer(forecast_arima$mean, series = "test") +
  theme_minimal()

2. Multi-seasonal Holt-Winter

fnb_msts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_hw$fitted, series = "train") +
  autolayer(forecast_hw$mean, series = "test") +
  theme_minimal()

3. TBATS model

fnb_msts %>% 
  autoplot(series = "actual") +
  autolayer(forecast_tbats$fitted, series = "train") +
  autolayer(forecast_tbats$mean, series = "test") +
  theme_minimal()

6.5 Compare accuracy

Evaluate Models Performance

modelacc <- rbind(
  accuracy(forecast_arima$mean, fnb_test_msts$visitor),
  accuracy(forecast_hw$mean, fnb_test_msts$visitor),
  accuracy(forecast_tbats$mean, fnb_test_msts$visitor))

rownames(modelacc) <- c("Multi-Seasonal ARIMA", "Multi-seasonal Holt-Winter", "TBATS model" )
modelacc
##                                   ME     RMSE      MAE MPE MAPE
## Multi-Seasonal ARIMA       -1.384508 5.905591 4.569183 NaN  Inf
## Multi-seasonal Holt-Winter  2.104726 6.000170 4.792508 Inf  Inf
## TBATS model                -1.026611 6.679003 5.370299 NaN  Inf

Based on above Accuracy Summary, The Accuracy of “Multi-Seasonal ARIMA” is better, MAE is 4.569183 (lower than 6), this model will be chosen to be used for Prediction -> The Best Model is “ARIMA Model”.

7 Visualization Actual vs Estimated

Data Preparation for the Plot :

accuracyData <- data.frame(ds= fnb_clean$ds %>% tail(13*7),
  Actual = as.vector(fnb_test_msts) ,
  TBATSForecast = as.vector(forecast_tbats$mean),
  ArimaForecast = as.vector(forecast_arima$mean),
  HWForecast = as.vector(forecast_hw$mean))

8 Visualization of Actual vs Estimated number of visitors (Best Model)

accuracyData %>% 
 ggplot() +
  geom_line(aes(x = ds, y = Actual.visitor, colour = "Actual"),size=1)+
  geom_line(aes(x = ds, y = ArimaForecast, colour = "Multi-Seasonal ARIMA Model "),size=1)+
  labs(title = "Hourly Visitor - Actual Vs Multi-Seasonal ARIMA Model ",x = "Date",y = "Visitor",colour = "")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

# Visualization of Actual vs Estimated number of visitors (All Models)

accuracyData %>% 
 ggplot() +
  geom_line(aes(x = ds, y = Actual.visitor, colour = "Actual"),size=0.5)+
  geom_line(aes(x = ds, y = HWForecast, colour = "Holt Winter Model"),size=0.1)+
  geom_line(aes(x = ds, y = ArimaForecast, colour = "Arima Model (Best Model)"),size=0.5)+
  geom_line(aes(x = ds, y = TBATSForecast, colour = "TBATS Model"),size=0.5)+
  labs(title = "Hourly Visitor - Actual Vs All Models",x = "Date",y = "Visitor",colour = "")

# Prediction Performance ## MAE < 6 in validation dataset

The Best Model is “Arima Model” as it has MAE Accuracy 4.569183, the smallest MAE compare the other created models.

accuracy(forecast_arima$mean, fnb_test_msts$visitor)
##                 ME     RMSE      MAE MPE MAPE
## Test set -1.384508 5.905591 4.569183 NaN  Inf

##MAE < 6 in test dataset

Predict using “Multi-Seasonal Data” and save The Prediction into CSV File submission-destian.csv by using The Best Model and Forecast : “ARIMA Mode”. The Step :

knitr::include_graphics("MAE.png")

fnb_test  <- read_csv("data-test.csv")
## Rows: 91 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## lgl  (1): visitor
## dttm (1): datetime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fnb_test$visitor <- forecast_arima$mean
write.csv(fnb_test, "submission-destian.csv", row.names = F)

9 Summary

9.1 Asumsion

We will check the Autocorrelation and Normality of our new model residual with the same test we used previously

  1. Autocorrelation test
# Residual Autocorrelation
   
Box.test(model_arima$residuals, type = "Ljung-Box",)
## 
##  Box-Ljung test
## 
## data:  model_arima$residuals
## X-squared = 0.00010908, df = 1, p-value = 0.9917

Conclucsion: p-value > 0.05, so we can assume there’s no-autocorrelation for residuals.

  1. Normality of residuals
shapiro.test(model_arima$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_arima$residuals
## W = 0.99139, p-value = 9.359e-06
hist(model_arima$residual, breaks = 30)

p-value > 0.05 : Residual normally distributed

p-value < 0.05 : Residual not normally distributed

The p-value < 0.05 so it can be concluded that the residuals are not normally distributed. This can happen because the size of the data held to form the model is not large enough. However, it does not mean that the model has bad forecasting performance. In this case, you can add the amount of data to build the model so that the residuals can be normally distributed and the forecasting performance is better. In addition, because the assumption of normality is not met, the error value obtained is not constant at 0. This causes if a forecast is to be made on data with a longer horizon, the error will tend to be larger. To overcome this, every time there is new historical data, a forecasting model must be re-built.

  1. Highest visitor
fnb_test %>% 
   mutate(hour=hour(datetime),
          seasonal=visitor,
          wday=wday(datetime, label = T, abbr = T)) %>% 

ggplot(aes(x = hour, y = seasonal))+
   geom_col(aes(fill=wday))+
   labs(title = "Seasonality across hour and weekdays", x = "Hour", y="Visitors")+
   theme_minimal()+
   theme(legend.position = "right")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

the highest visitor come to restaurant is on 20.00 WIB. also the highest day on Saturday